143 research outputs found
Many-Core CPUs Can Deliver Scalable Performance to Stochastic Simulations of Large-Scale Biochemical Reaction Networks
Stochastic simulation of large-scale biochemical reaction networks is becoming essential for Systems Biology. It enables the in-silico investigation of complex biological system dynamics under different conditions and intervention strategies, while also taking into account the inherent "biological noise" especially present in the low species count regime. It is however a great computational challenge since in practice we need to execute many repetitions of a complex simulation model to assess the average and extreme cases behavior of the dynamical system it represents. The problem's work scales quickly, with the number of repetitions required and the number of reactions in the bio-model. The worst case scenario s when there is a need to run thousands of repetitions of a complex model with thousands of reactions. We have developed a stochastic simulation software framework for many- and multi-core CPUs. It is evaluated using Intel's experimental many-cores Single-chip Cloud Computer (SCC) CPU and the latest generation consumer grade Core i7 multi-core Intel CPU, when running Gillespie's First Reaction Method exact stochastic simulation algorithm. It is shown that emerging many-core NoC processors can provide scalable performance achieving linear speedup as simulation work scales in both dimensions
Harnessing Performance Variability in Embedded and High-performance Many/Multi-core Platforms
This book describes the state-of-the art of industrial and academic research in the architectural design of heterogeneous, multi/many-core processors. The authors describe methods and tools to enable next-generation embedded and high-performance heterogeneous processors to confront cost-effectively the inevitable variations by providing Dependable-Performance: correct functionality and timing guarantees throughout the expected lifetime of a platform under thermal, power, and energy constraints. Various aspects of the reliability problem are discussed, at both the circuit and architecture level, the intelligent selection of knobs and monitors in multicore platforms, and systematic design methodologies. The authors demonstrate how new techniques have been applied in real case studies from different applications domain and report on results and conclusions of those experiments.
Enables readers to develop performance-dependable heterogeneous multi/many-core architectures
Describes system software designs that support high performance dependability requirements
Discusses and analyzes low level methodologies to tradeoff conflicting metrics, i.e. power, performance, reliability and thermal management
Includes new application design guidelines to improve performance dependabilit
The Unexpected Efficiency of Bin Packing Algorithms for Dynamic Storage Allocation in the Wild: An Intellectual Abstract
Recent work has shown that viewing allocators as black-box 2DBP solvers bears
meaning. For instance, there exists a 2DBP-based fragmentation metric which
often correlates monotonically with maximum resident set size (RSS). Given the
field's indeterminacy with respect to fragmentation definitions, as well as the
immense value of physical memory savings, we are motivated to set
allocator-generated placements against their 2DBP-devised, makespan-optimizing
counterparts. Of course, allocators must operate online while 2DBP algorithms
work on complete request traces; but since both sides optimize criteria related
to minimizing memory wastage, the idea of studying their relationship preserves
its intellectual--and practical--interest.
Unfortunately no implementations of 2DBP algorithms for DSA are available.
This paper presents a first, though partial, implementation of the
state-of-the-art. We validate its functionality by comparing its outputs'
makespan to the theoretical upper bound provided by the original authors. Along
the way, we identify and document key details to assist analogous future
efforts.
Our experiments comprise 4 modern allocators and 8 real application
workloads. We make several notable observations on our empirical evidence: in
terms of makespan, allocators outperform Robson's worst-case lower bound
of the time. In of cases, GNU's \texttt{malloc}
implementation demonstrates equivalent or superior performance to the 2DBP
state-of-the-art, despite the second operating offline.
Most surprisingly, the 2DBP algorithm proves competent in terms of
fragmentation, producing up to x better solutions. Future research can
leverage such insights towards memory-targeting optimizations.Comment: 13 pages, 10 figures, 3 tables. To appear in ISMM '2
Adjacent LSTM-Based Page Scheduling for Hybrid DRAM/NVM Memory Systems
Recent advances in memory technologies have led to the rapid growth of hybrid systems that combine traditional DRAM and Non Volatile Memory (NVM) technologies, as the latter provide lower cost per byte, low leakage power and larger capacities than DRAM, while they can guarantee comparable access latency. Such kind of heterogeneous memory systems impose new challenges in terms of page placement and migration among the alternative technologies of the heterogeneous memory system. In this paper, we present a novel approach for efficient page placement on heterogeneous DRAM/NVM systems. We design an adjacent LSTM-based approach for page placement, which strongly relies on page accesses prediction, while sharing knowledge among pages with behavioral similarity. The proposed approach leads up to 65.5% optimized performance compared to existing approaches, while achieving near-optimal results and saving 20.2% energy consumption on average. Moreover, we propose a new page replacement policy, namely clustered-LRU, achieving up to 8.1% optimized performance, compared to the default Least Recently Used (LRU) policy
Resource Aware GPU Scheduling in Kubernetes Infrastructure
Nowadays, there is an ever-increasing number of artificial intelligence inference workloads pushed and executed on the cloud. To effectively serve and manage the computational demands, data center operators have provisioned their infrastructures with accelerators. Specifically for GPUs, support for efficient management lacks, as state-of-the-art schedulers and orchestrators, threat GPUs only as typical compute resources ignoring their unique characteristics and application properties. This phenomenon combined with the GPU over-provisioning problem leads to severe resource under-utilization. Even though prior work has addressed this problem by colocating applications into a single accelerator device, its resource agnostic nature does not manage to face the resource under-utilization and quality of service violations especially for latency critical applications.
In this paper, we design a resource aware GPU scheduling framework, able to efficiently colocate applications on the same GPU accelerator card. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks. We show that our scheduler can achieve 58.8% lower end-to-end job execution time 99%-ile, while delivering 52.5% higher GPU memory usage, 105.9% higher GPU utilization percentage on average and 44.4% lower energy consumption on average, compared to the state-of-the-art schedulers, for a variety of ML representative workloads
EDEN: A high-performance, general-purpose, NeuroML-based neural simulator
Modern neuroscience employs in silico experimentation on ever-increasing and
more detailed neural networks. The high modelling detail goes hand in hand with
the need for high model reproducibility, reusability and transparency. Besides,
the size of the models and the long timescales under study mandate the use of a
simulation system with high computational performance, so as to provide an
acceptable time to result. In this work, we present EDEN (Extensible Dynamics
Engine for Networks), a new general-purpose, NeuroML-based neural simulator
that achieves both high model flexibility and high computational performance,
through an innovative model-analysis and code-generation technique. The
simulator runs NeuroML v2 models directly, eliminating the need for users to
learn yet another simulator-specific, model-specification language. EDEN's
functional correctness and computational performance were assessed through
NeuroML models available on the NeuroML-DB and Open Source Brain model
repositories. In qualitative experiments, the results produced by EDEN were
verified against the established NEURON simulator, for a wide range of models.
At the same time, computational-performance benchmarks reveal that EDEN runs up
to 2 orders-of-magnitude faster than NEURON on a typical desktop computer, and
does so without additional effort from the user. Finally, and without added
user effort, EDEN has been built from scratch to scale seamlessly over multiple
CPUs and across computer clusters, when available.Comment: 29 pages, 9 figure
BrainFrame: A node-level heterogeneous accelerator platform for neuron simulations
Objective: The advent of High-Performance Computing (HPC) in recent years has
led to its increasing use in brain study through computational models. The
scale and complexity of such models are constantly increasing, leading to
challenging computational requirements. Even though modern HPC platforms can
often deal with such challenges, the vast diversity of the modeling field does
not permit for a single acceleration (or homogeneous) platform to effectively
address the complete array of modeling requirements. Approach: In this paper we
propose and build BrainFrame, a heterogeneous acceleration platform,
incorporating three distinct acceleration technologies, a Dataflow Engine, a
Xeon Phi and a GP-GPU. The PyNN framework is also integrated into the platform.
As a challenging proof of concept, we analyze the performance of BrainFrame on
different instances of a state-of-the-art neuron model, modeling the Inferior-
Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley
representation. The model instances take into account not only the neuronal-
network dimensions but also different network-connectivity circumstances that
can drastically change application workload characteristics. Main results: The
synthetic approach of three HPC technologies demonstrated that BrainFrame is
better able to cope with the modeling diversity encountered. Our performance
analysis shows clearly that the model directly affect performance and all three
technologies are required to cope with all the model use cases.Comment: 16 pages, 18 figures, 5 table
Co-Design of Approximate Multilayer Perceptron for Ultra-Resource Constrained Printed Circuits
Printed Electronics (PE) exhibits on-demand, extremely low-cost hardware due to its additive manufacturing process, enabling machine learning (ML) applications for domains that feature ultra-low cost, conformity, and non-toxicity requirements that silicon-based systems cannot deliver. Nevertheless, large feature sizes in PE prohibit the realization of complex printed ML circuits. In this work, we present, for the first time, an automated printed-aware software/hardware co-design framework that exploits approximate computing principles to enable ultra-resource constrained printed multilayer perceptrons (MLPs). Our evaluation demonstrates that, compared to the state-of-the-art baseline, our circuits feature on average 6x (5.7x) lower area (power) and less than 1% accuracy loss
- …